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ABSTRACT 

We present a novel approach to pseudo-feedback-based ad 
hoc retrieval that uses language models induced from both 
documents and clusters. First, we treat the pseudo-feedback 
documents produced in response to the original query as 
a set of pseudo-queries that themselves can serve as input 
to the retrieval process. Observing that the documents re- 
turned in response to the pseudo-queries can then act as 
pseudo-queries for subsequent rounds, we arrive at a formu- 
lation of pseudo-query-based retrieval as an iterative pro- 
cess. Experiments show that several concrete instantiations 
of this idea, when applied in conjunction with techniques 
designed to heighten precision, yield performance results ri- 
valing those of a number of previously-proposed algorithms, 
including the standard language-modeling approach. The 
use of cluster-based language models is a key contributing 
factor to our algorithms' success. 

Categories and Subject Descriptors: H.3.3 [Informa- 
tion Search and Retrieval]: Retrieval models, Clustering 

General Terms: Algorithms, Experimentation 

Keywords: language modeling, clustering, pseudo-feedback, 
pseudo-queries, rendition, query drift, cluster-based language 
models, aspect recall 

1. INTRODUCTION 

Statistical language models have become an important 
tool in information retrieval, and have been applied to many 
settings 7 . In the case of fully automatic ad hoc IR, where 
the task is to find documents relevant to a query q with- 
out access to relevance-feedback information, a great deal 
of recent research builds upon Ponte and Croft's initial pro- 
posal 12 II wherein the rank of a document d is based on the 
probability assigned to q by a language model constructed 
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from d. We can gloss this ranking principle as, "retrieve the 
documents that are the best renderers of the query". 1 

The work presented in this paper is partly motivated by 
the following hypothesis: documents that are the best ren- 
derers of a query may be good alternate renditions of it. 
Indeed, a basic premise behind query-expansion techniques 
utilizing pseudo-feedback is that top-retrieved documents 
may reveal dimensions of the user's information need that 
are not obvious from the original (short) query 

EU. 2 As- 
suming for now that the hypothesis is true (we discuss it fur- 
ther below), we therefore propose a type of pseudo-feedback 
approach in which query "expansion" consists of wholesale 
replacement of q with a list of pseudo-queries consisting of 
the query's best renderers. 

Pseudo-queries are clearly a form of pseudo-feedback. How- 
ever, the former term suggests that once we have created 
pseudo-queries from the initial query, we can in principle 
repeat the process, this time seeking the top renderers of the 
pseudo-queries. And if the pseudo-queries are indeed more 
informative than their predecessor(s), then we expect this 
repetition to improve the retrieval results. We thus arrive at 
an iterative boot-strapping approach in which the previously- 
retrieved best renderers become the pseudo-queries for the 
next round. 

Unfortunately, pseudo-feedback quality can suffer from 
problems with both precision and "aspect recall" . The cur- 
rently unavoidable phenomenon of non-relevant documents 
appearing in the retrieval results leads to query drift, "the 
alteration of the focus of a search topic caused by improper 
expansion" |21| 3 . As for recall, key aspects of the user's 
information need may be completely missing from the pool 
of top-retrieved documents, due to both small pool size (in 
order to keep precision reasonable) and selection for docu- 



Our choice of terminology — "renderers" rather than "gen- 
erators" — reflects the fact that we do not assume that doc- 
uments (or their induced language models) are the source 
that "generates" q: A monkey randomly striking typewriter 
keys may produce a word-for-word copy of Hamlet, but we 
do not therefore say that it is the author of the play. 
2 This is another motivation behind our terminology: in the 
hands of a skilled artist, a rendition of a particular piece 
may be faithful to the original in many respects, and yet 
still be superior overall. 

3 We are not referring to "query drift" in the sense of user 
interests changing over time PQ. 



ments most resembling the short — and hence potentially 
not completely informative — initial query. The inability to 
cope with this missing-aspect problem is viewed as a major 
failing of current systems |HllU|. 

To increase aspect recall, we use the structure of the cor- 
pus, as manifested through language models built on doc- 
ument clusters, to suggest and represent potential facets of 
the user's needs |13l 1201 . In particular, we consider find- 
ing good Tenderers among a set of clusters rather than the 
set of documents: a cluster that is a good renderer may 
contain documents that, while relevant, superficially don't 
match the query string precisely because they include as- 
pects not immediately evident in q. (Such documents could 
be present in the cluster by dint of being similar to other 
relevant documents with respect to non-query terms.) 

As for query drift, the problem would seem to be exacer- 
bated by the multiple iterations performed by our algorithm, 
since poor-quality input early in the pipeline has a poten- 
tially disastrous effect on retrieval results later on. To cope 
with this difficulty, we provide a number of methods that 
"re-anchor" pseudo-queries to the original query. 

Experiments with several large corpora reveal significant 
improvements in both average precision and recall over the 
standard language-modeling approach (which corresponds 
to a degenerate version of our methods). This finding sug- 
gests that pseudo-queries may indeed be better than the 
original query as a basis for retrieval. Moreover, compar- 
isons against two highly effective techniques incorporating 
pseudo-relevance feedback — Rocchio on pseudo-feedback 
and Lavrenko and Croft's language-model-based relevance 
model I18| — show that our cluster-based methods can of- 
ten provide competitive or superior performance. 

2. RETRIEVAL FRAMEWORK 

We now present a suite of fully automatic iterative algo- 
rithms for processing pseudo-queries. As mentioned above, 
all our algorithms conform to the same general format: find 
the best Tenderers of a current set of pseudo-queries; then, 
repeat the process using these best Tenderers as the new 
pseudo-queries. 

We begin by establishing some notation and conventions 

m 

Section O Then, we discuss the two main axes along 
which our algorithms vary. The first such axis, detailed in 
Section 12.21 is the basic definition of a good renderer; the 
options we consider are: (1) a document that is a good ren- 
derer of at least one pseudo-query; (2) a document that is 
a good renderer of multiple pseudo-queries; and (3) a clus- 
ter that is a good renderer of multiple pseudo-queries. The 
second main axis of algorithm variation, discussed in Sec- 
tion 12.31 is the choice of mechanism for preventing query 
drift. One idea we pursue is to incorporate the rendition 
probability assigned to the original query. 

2.1 Notation and Conventions 

Throughout this section, T> denotes a given document set, 
which induces a fixed vocabulary. The notation q* indicates 
the user's initial query, and N stands for the number of doc- 
uments to be returned in response to q* when the retrieval 
process terminates. Lower-case Greek letters indicate algo- 
rithm parameters that were varied in our experiments. 

We use C to refer to a set of clusters of the documents 
in T> (in our work, C is computed prior to retrieval time). 
We freely switch between thinking of a cluster as a subset 



of T> and thinking of it as the single text string created by 
concatenating its constituent documents in some pre-defined 
order 4 — this allows us to treat queries, documents, and 
clusters uniformly as sequences of terms. 

The algorithms we present engage in iterative processing 
of pseudo-queries. In what follows, we assume that in each of 
the p rounds, the pseudo-query input consists of a ranked list 
Q = qi, §2, ■ • ■ together with a weight function w : Q — ► [0, 1] 
such that w(qi) > {D(gi+i). In the first iteration, Q — q* , 
and we set w(q*) to be 1. In subsequent iterations, Q is an 
ordering of the documents in T>. 

A key concept in our work is that of a renderer r's reper- 
toire among a set of text strings, by which we mean the sub- 
set of the strings that r is a top renderer of. We therefore 
make the following definitions. Let p r (x) denote an estimate 
of the probability that r (in our work, either a document or 
a cluster) renders the text sequence x. 

Definition 1. Let x be a text sequence, and let R be a 
finite set of potential renderers. Then, x 's top k Tenderers 
in R, denoted TopRen(x; R, k), is the set of k items r G R 
that yield the highest ° p r {x). 

Definition 2. For renderer r, set of text strings X , set 
of potential renderers R 9 r, and positive integer k, we de- 
fine the repertoire of r in X with respect to R and k as 

Rep(r; X\R,k) d = {x G X : r G TopRen(x; R, k)}. 

For compactness, we suppress X , R, and/or k in our nota- 
tion when no confusion can result. 

Note that repertoires are sets, not sorted lists; thus, when 
we say that a pseudo-query q G Rep(r) is "highly ranked", 
we mean that it occurs early in Q, as opposed to, say, that 
its rendition probability p r (q) is large with respect to the 
other members of Rep(r). 

2.2 Basic Methods for Scoring Renderers 

We now present three basic options for determining the 
best renderers of a given iteration's pseudo-queries. Each 
such method M takes as input the ranked list of pseudo- 
queries Q and the associated pseudo-query weights w(%) 
and produces a score Score m{o\) for each document d. The 
input to the next round can then be created by setting 
w(d) = Score Af(rf) and sorting all the documents in T> by 
this quantity, unless additional mechanisms for coping with 
query drift are applied (see Section Xl. 31 . 

We begin by restricting our consideration of possible ren- 
derers to documents. The Viterbi Doc-Audition scoring 
method is a straightforward procedure that ranks those doc- 
uments with repertoires containing a highly-weighted pseudo- 
query above those that are top renderers only of lower- 
weighted ones. Specifically, in each round, it first returns 
the top t renderers of q\ , then the top r renderers of 52 (re- 
peated renderers discarded), and so on, with the list of top 
renderers d of each % sorted in descending order of Pd(qi)- 6 
Hence, suppose we have two top renderers of q\, d and d' , 

4 In our work, the concatenation order is irrelevant since we 
use unigram language models. 

5 Throughout this paper, we assume that the corpus docu- 
ments and clusters are identified by numeric labels, with ties 
broken in favor of lower-numbered items. 
6 Although it does not convey more insight than the English 



such that Pd(qi) > Pd'($i)- Then d would be ranked above 
d' even if Pd{%) < Pd'(%) for every other %. 

If we were confident that q\ is indeed the best represen- 
tation of the user's information need, then such behavior is 
not unreasonable. In practice, however, such confidence may 
not be warranted; rather, we might consider a document 
that renders many of the pseudo-queries to be potentially 
as good or better a candidate for retrieval than a document 
that renders only one. Hence, we define an alternative scor- 
ing method, Doc-Audition. In essence, it rewards a poten- 
tial Tenderer d £T> for every pseudo-query in its repertoire, 
although less credit is assigned for low-weight % and for % 
that d assigns a low rendition probability to. Specifically, 
ranks are induced by the following function: 



Score Doc (d) 



E 

q CE Rep (d\ ■ 



w(q) 

■) 



Pd(q) 
K(q; m) ' 



(1) 



where the re-scaling term K(q; m) = Ed' eT opRon(g ; -D,m) Pd' (?) 
serves to compare Pd{q) to the rendition probabilities of the 
top m Tenderers 7 of q. Observe that in the first round, only 
the top Tenderers of the original query can receive non-zero 
scores. 

Like all pseudo-feedback techniques, both scoring meth- 
ods just presented can help ameliorate the crucial "aspect 
recall" problem. We expect that the most improvement rel- 
ative to retrieval based directly on the original query q* 
will occur when the pseudo-queries contain effective search 
terms not appearing in q* , and among the best Tenderers 
of these pseudo-queries are relevant documents containing 
the missing terms but not having high overlap with q* . Our 
third basic scoring method, Cluster- Audition, makes use 
of document clusters to take more explicit advantage of such 
situations. In particular, we choose Tenderers of the pseudo- 
queries from the set C of document clusters rather than from 
the set of documents. Ideally, the best Tenderers would be 
clusters consisting only of relevant documents (thus increas- 
ing recall) , and some of these clusters would represent infor- 
mation whose importance is implied by the query but not 
explicitly mentioned by it (thus increasing "aspect recall"). 
In any event, integrating clusters into language-model-based 
retrieval has recently been shown to yield substantial per- 
formance improvements [131 I2()| . 

While there are a huge number of clustering methods to 
choose from, some quite sophisticated, we simply create one 
cluster for every document by grouping together that doc- 
ument's top 7 Tenderers. This method is convenient for us 
since we already need to compute top Tenderers; moreover, 



description just given, for the sake of completeness, here is a 
formulation of the Viterbi Doc- Audition scoring method in 
terms of an explicit score function. For a given document d £ 
T>, let be the highest-ranked pseudo-query q in Rep(d|r), 
and let i + be q^'s rank. (If d is not a top Tenderer of any 
pseudo-query, we set i + to \T>\ + 1 and g 4 " to the dummy 
value q* .) Then, we define 



o ,,,defp d (q + )+2(\V\ 
ScorevDoc(a) = 



+ D 



l + 2\V\ 

7 In our experiments, we fixed m to a value guaranteed to be 
greater than r to handle cases where more than r documents 
had high rendition probabilities for q. However, preliminary 
experiments indicated that choosing m — t did not substan- 
tially alter results. 



it has proven effective in previous work, perhaps because 
the highly overlapping clusters can be seen as representing 
different facets of the similarity structure of the corpus |13|. 
Indeed, there is a history of successful applications of the 
general nearest-neighbor approach (e.g., [5]). 

Within each iteration, Cluster- Audition scoring consists 
of two phases. In the first, each cluster c is credited for 
every pseudo-query in its repertoire. Note that in comput- 
ing repertoires, we chose to restrict the set of possible top 
Tenderers of a pseudo-query q to C(q), the set of clusters 
containing q: a cluster is only "allowed" to render its con- 
stituent documents 8 . (For the sake of readability, we sup- 
press this restriction in the repertoire notation below.) We 
thus have: 



Scoreciust (c) 



del 



gGRcp(c|r) 



Pc(q) 
Ki{qY 



(2) 



where Ki(q) = $3c'ec($) Pc' (?) re-scales c's rendition proba- 
bility with respect to the set of clusters containing q. 

The purpose of the second phase of scoring is to convert 
the implicit cluster ranking just computed into a document 
ranking, since the output of each round should be a legal set 
of final retrieval results. This is achieved by crediting each 
document for every cluster it is one of the top a Tenderers 
of, with the restriction (again suppressed in the repertoire 
notation below to enhance readability) that a document may 
only render a cluster that it belongs to: 



Scoreciust (d) = Scoreciust (c) 

c£ER.cp(d|tr) 



Pd(c) 
K 2 (c) 



(3) 



where Ki (c) = ~}2 d ' ecPd' ( c ) re ~scales d's rendition probabil- 
ity with respect to the set of documents within c. 

Remarks. If the desired values of r and a (the parame- 
ters controlling the sizes of the top-renderer sets) and 7 
(the parameter for cluster size) are known beforehand, then 
the clusters and top-renderer sets can be computed off-line, 
greatly reducing the amount of computation required at re- 
trieval time. However, even if these parameters are not 
pre-specified, one can pre-compute a ranking of all possible 
Tenderers of each document; this still results in significant 
computational savings at run-time. 

We note that the iterative processes based upon the lat- 
ter two of the basic scoring schemes we have just described 
can be conceptualized as a fixed-length random walk on 
a graph corresponding to a Markov chain whose structure 
is determined by top-renderer relationships. In fact, the 
Cluster- Audition scoring method is reminiscent of the term- 
to-document Markov chain used by Lafferty and Zhai |15| 
for query expansion. However, in our case it is not clear 
what insight is gained by such a formulation; for instance, 
how any stationary distribution should be interpreted in the 
context of the retrieval task at hand is not obvious. 

Finally, notice that all three methods just outlined contain 
the standard language-modeling approach [51] as a degen- 
erate one-iteration (or half-iteration, for Cluster-Audition) 
case where r = \T>\, and for Cluster- Audition, the cluster set 
C corresponds to a partition of T> into single-element sets. 



In the first iteration only, we treat the original query as a 
document belonging to all clusters. 



2.3 Coping with Query Drift 

We have previously mentioned that one of our main in- 
terests is increasing so-called "aspect recall". Naturally, 
we want to simultaneously retain high precision as well. 
However, a potential drawback of our iterative approach to 
pseudo-query processing is that engaging in multiple rounds 
threatens to exacerbate query drift: early contamination of 
the set of pseudo-queries with non-relevant documents can 
seriously skew downstream pseudo-query sets away from the 
user's true information needs. Using clusters adds even more 
risk of overgeneralization. We therefore propose a number of 
methods for addressing the query-drift problem. All follow 
the same general strategy: ensure that information from the 
original query q* plays a large role. 

Two indirect techniques all our algorithms employ involve 
choosing appropriate values for certain parameters. First, 
we limit p, the total number of rounds, to a relatively small 
number. Second, since the initial iteration is the one that is 
"closest" to the original query, we give it privileged status 
by considering the top n, rather than r, Tenderers of q* . 

We also consider a number of re-scoring techniques; these 
directly use Pd{q*) in some combination with the output of 
one of the three basic scoring methods M introduced above. 
Recall that without re-scoring, the pseudo-query weights 
w(d) for round t + 1 would simply be the score assigned 
by M to document d at the end of round t. 

We borrow two re-scoring techniques from research on 
cluster-based retrieval within the language-modeling frame- 
work 13 20 . Both affect only the output of the final round, 
since they were introduced in the context of non-iterative 
methods. The P* Interpolation technique derives a new 
final score for each document d by linear interpolation of 
Score^'(d), d's score in the final round according to M, 
with the rendition probability that d assigns to the original 
query (after rescaling both quantities with respect to their 
maximum values to ensure comparability): 

AScore$(d) + (l-A)p d (<z*). 

It thus integrates our iterated estimate of document rele- 
vance with surface document-query similarity. 

In contrast, the Truncated P* Re-rank method first 
discards all but the top -/V documents according to Score ^ (d); 
each remaining document d is then given the new score 
Pd(q*)- Note that this method does not affect recall, since 
only the order of the retrieved results is changed. 

Alternatively, we could alter the scores at the end of ev- 
ery round, rather than just the final one, as an attempt to 
counteract query drift early in the process. One idea, imple- 
mented in the Iterated Truncation technique, is to con- 
sider only the top N documents in a given iteration or pass 
to be likely to be informative pseudo-queries for the next 
round; the scores of all the other documents are therefore 
zeroed. The Iterated Truncated P* Re-rank technique 
goes even further by additionally changing the scores of the 
top N documents to pd(q*)- That is, the Truncated P* Re- 
rank technique is applied to each round, rather than just 
to the final one. Similarly, we can apply P* Interpolation 
at each round, thus yielding the Iterated P* Interpola- 
tion technique. Note that this method is more conservative 
than the original P* Interpolation technique because it tends 
to prevent pseudo-queries with low surface similarity to the 
query from being assigned high scores in early rounds. 



2.4 Estimating Rendition Probabilities 

Rendition probabilities are the foundation upon which all 
our algorithms are built. To describe the method by which 
we estimate them, we first introduce some preliminary con- 
cepts. Let y be either a text string or a set of text strings. 
Denoting the number of times a term w occurs in y by 
tf(w £ y), for an n-term text sequence W1W2 ■ • • w„ we define 



{W1W2 ■ 



■w n ) = Yl 



tf (wj G y) 



\ J2 W ' ti{w'€yy 



this is commonly known as y's maximum likelihood estimate 
(MLE) for the sequence. The Dirichlet- smoothed version of 
the MLE is defined as 



p^(wiW2 ■ ■ ■ Wn) 



n 

3=1 



£„,' tf K e y) + 11 



where the smoothing parameter jj, controls the degree of re- 
liance on relative frequencies in the corpus rather than on 
the counts in y. 

While the Dirichlet-smoothed unigram language model 
just defined has been used directly |32l 121 II . we adopt the 
following variant: for Tenderer r and text sequence x, we set 



Pr{x) = exp 



-D 



pf L (0 11^(0)) 



where D is the KL divergence, which has formed the ba- 
sis for other ranking principles as well |3L)I 1151 \TI\ I13| : the 
two arguments to D are treated as distributions over terms 
rather than term sequences; and the omitted factor (which 
is of independent interest in other contexts |14|1 drops out in 
the re-scaling performed by the Doc-Audition and Cluster- 
Audition scoring methods. Our formulation provides some 
mathematical justification for Lavrenko et al.'s "heuristic 
adjustment" |17|. proposed to handle underflow problems in 
processing long documents, to take the geometric mean of 
pi M ' (x) rather than (a;) itself. 

3. RELATED WORK 

The fundamental principle underlying pseudo-feedback- 
based methods is that the top-ranked documents retrieved 
in response to a query may contain additional information 
regarding the user's information need. The canonical ap- 
proach is to treat the pseudo-feedback documents as if they 
had actually been deemed relevant by the user, and then 
apply relevance-feedback techniques |2(j| to them. One such 
line of work is to use the feedback documents to re-weight 
query terms and/or to identify additional terms with which 
to augment the query; since our wholesale replacement of the 
query with pseudo-queries can be considered an extension of 
this idea, in Section 2] we compare against one well-known 
instantiation, namely, Rocchio |25| . Within the language- 
modeling retrieval framework, treating the feedback docu- 
ments as relevant often means estimating rendition proba- 
bilities using the feedback pool (where the members may be 
differentially weighted) as data 1161 IT51 1311 129| . 

Lafferty and Zhai II 51 proposed an iterative probability- 
estimation sub-routine that alternates between terms and 
documents, which is remiscent of the shifting between clus- 
ters and documents that our Cluster- Audition algorithm 
represents; but their intended application is not a direct 



scoring of potential retrieval candidates. Lavrenko and Croft's 
relevance model algorithm |18| has the same goal as ours, is 
also based on language models, and posts state-of-the-art 
performance; Section [I] describes it in more detail and re- 
ports the results of our experimental comparisons against 
it. 

Query drift has long been recognized as a key concern 
for pseudo- feedback approaches Ej] , and hence a num- 
ber of coping techniques have been previously introduced. 
One example is the application of boolean filters and term 
co-occurrence analysis |21|. The techniques we adopted fo- 
cus instead on incorporating rendition probabilities for the 
original query, borrowing from previous work [3111201 H^) . 

4. EXPERIMENTS 

To examine the effectiveness of our algorithms and to de- 
termine how much various aspects of our proposed retrieval 
framework contribute, we designed a number of evaluation 
experiments. 

First, we compare the performance of our algorithms to 
that of a language-model-based approach (henceforth base- 
line) in which documents are ranked according to pd{q*)- 
This comparison serves not only to see whether our meth- 
ods can outperform an effective retrieval system, but also 
to highlight the merits (or lack thereof) of engaging in mul- 
tiple iterations of pseudo-query processing, since, as noted 
above, conceptually the basic language-modeling approach 
corresponds to a single round or pass of our algorithms. 

We also test our techniques for pseudo-query-processing 
against a well known and highly effective pseudo-feedback 
method, the Rocchio algorithm 25 as applied to top-retrieved 
documents. 

Finally, we study whether our particular ways of utiliz- 
ing language models are beneficial by testing how well they 
perform against the relevance model |18II19| . The latter ap- 
proach takes a generative perspective: assuming that there 
is a single relevance language model 1Z underlying the cre- 
ation of both q* and the documents relevant to q* , docu- 
ments are ranked by their degree of "match" with 1Z, rather 
than by how well they directly match the query or set of 
pseudo-queries. In implementation, Lavrenko and Croft es- 
timate 1Z by combining the language models of those doc- 
uments assigning the highest rendition probabilities to q* . 
Thus, the relevance-model, similarly to our algorithms, is 
a pseudo- feedback-based language-modeling approach, but 
clearly the specific way in which document-based language 
models are used is quite different from the ways our algo- 
rithms employ them, and Lavrenko and Croft made no ex- 
plicit mention of clusters. 

Although our reference comparison models operate in dif- 
ferent spaces (vector space vs. the probability simplex), in 
|19| it is observed that if i.i.d sampling sampling is used for 
relevance model estimation, then both the pseudo-feedback 
version of Rocchio and the relevance model utilize a linear 
combination of the top-retrieved documents' models to con- 
struct an expanded query model. 

We conducted our experiments on the following three cor- 
pora, drawn from TREC data: 

corpus # of docs queries # of relevant docs 

AP89 8l~678 1-46,48-50 3261 

AP88+89 164,597 101-150 4805 

LA+FR 187,526 401-450 1391 



The AP89 corpus was pre-processed with the Porter stem- 
mer. For AP88+89 the Krovetz stemmer was used, and 
both INQUERY stopwords |3| and length-one tokens were 
removed to comply with the processing policy in |18| . For 
LA+FR, which is part of the TREC-8 corpus, neither stem- 
ming nor stopword removal was applied. It is relatively het- 
erogeneous, and the LA dataset with TREC8 queries is con- 
sidered to be difficult |11|. 

For queries, we used the titles of TREC topics rather than 
the full descriptions, resulting in short queries containing 2-5 
terms on average. 

We use both average non-interpolated precision and recall 
at N — 1000 as our evaluation measures. Statistically sig- 
nificant differences in performance are determined using the 
two-sided Wilcoxon test at the 95% confidence level. 

4.1 Implementation 

We employed the Lemur toolkit |23| for a number of our 
experiments. To collect pseudo- feedback, we used Pd(q) to 
create an initial ranking All parameters were set to values 
optimizing average non-interpolated precision. 

Our implementation of Rocchio used the vector-space model 
with log tf.idf term weighting to represent queries and docu- 
ments. Similarity was measured via the inner product. The 
free parameters were: (i) n - the number of top-retrieved 
documents used for feedback, (ii) the number of terms to 
augment the original query with, and (iii) the weighting co- 
efficient for the augmenting terms (we only used positive 
feedback). Note that while the number of (augmenting) 
terms is not modeled in Rocchio's original method, we var- 
ied it to obtain better performance and comply with the 
optimization steps we implemented for the relevance model 
(details further below). Our implementation yielded accura- 
cies consistent with previously reported (optimized) results. 

Our Lemur-based implementation of the relevance model 
utilized i.i.d sampling (following |19|') to construct 7Z; the di- 
vergence D (1Z 1 1 Pd(-)) served as ranking criterion. The free 
parameters were ri (the number of top-retrieved documents) 
and an interpolation parameter controlling the evaluation of 
the top retrieved documents' language models. 

We also experimented with clipping the relevance model 
to assign non-zero probability to only a restricted num- 
ber of terms (up to a maximum of several hundred). This 
modification can be viewed as regulating the degree of query 
expansion, or as an efficiency-improving heuristic. 

Some of our algorithms have quite a few free parameters. 
To help prevent our algorithms from enjoying an unfair ad- 
vantage due to this fact alone, we implemented the follow- 
ing policies. The language models forming the basis of both 
our methods and the baseline had the Dirichlet smoothing 
parameter value fixed at /i = 2000, following |32|. All pa- 
rameters shared by our methods and the relevance models 
(e.g., ri) were set identically for all the algorithms, with the 
following search ranges: 

t, the number of top Tenderers considered: {5, 10, 20, . . . , 100} 
for Viterbi Doc- Audition; {5, 10, 20, 30, 40} for Doc- Audition; 
from {1,2,3,4} for Cluster- Audition. 

Ti , the number of best Tenderers retrieved at the first it- 
eration: {5} U {10, 20, 100} U {200, 300, 400, 500}. 

o, the number of documents to which a cluster's score 
is distributed (Equation EJ: {5,10,20,30,40} for AP89 and 
AP88+89; {5,10} for LA+FR. 

7, cluster size: 40 for AP89 and AP88+89; 10 for LA+FR. 
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21.72% 
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55.20% 


28.28%,* 
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59.67% 


Int-Doc 


2448%* 


53.77%* 


30.28%,* 


75.57%,* 


22.98%o* 


56.00%o* 


Clust 


23.43%,* 


63.57%* 


28.76%,* 


80.15%* 


23.24%,* 


56.79%, 


Int-Clust 


24.56%* 


62.77%* 


31.09%* 


76.15%,* 


23.40% 


54.57%,* 



Table 1: Comparison against the baseline. Statistically significant differences with the baseline are marked 
with a star (*). Bold: best performance for each setting (column). Italics: results superior to the baseline. 
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Rocchio 


22.85% 


58.23% 


30.69% 


76.02% 


18.21% 


52.26% 


RelModel 


24.72% 


58.08% 


32.72% 


81.60% 


22.03% 


46.59% 


ClippodRelModel 


26.17% 


63.48% 


32.51% 


82.10% 


22.34% 


56.65% 



Table 2: Comparison of the best pseudo-query algorithms against Rocchio, the relevance model, and the 
clipped relevance model. Statistically significant differences with these algorithms are marked with 77, 1Z, and 
c, respectively. Bold: best performance for each setting (column). 



p, the number of rounds: 1-2, Cluster- Audition; 1-5, Viterbi 
Doc-Audition and Doc-Audition. 

4.2 Main results 

We report results using the following abbreviations. 

VDoc Viterbi Doc- Audition 
Doc Doc-Audition 
Clust Cluster- Audition 

The prefix "Int-" indicates that the (non-iterated) P* In- 
terpolation technique was employed for coping with query 
drift; in all other cases, Truncated P* Re-rank was applied. 
Results for other query-drift amelioration mechanisms are 
reported later in this section. 

Table compares our algorithms' performance to that of 
the language-model baseline. We see that almost all our 
methods outperform the language model in both average 
precision and recall, often to a statistically significant de- 
gree. Moreover, it is clear that our cluster-based algorithm 
Cluster- Audition, in either its Truncated P* Re-rank or P* 
Interpolation version, yields the best results, outperforming 
not only the language-modeling approach but all the non- 
cluster-based algorithms we have proposed. This finding 
further reinforces conclusions previously drawn in the liter- 
ature regarding the advantages of using clusters to represent 
cross-document contextual information I13II2()| : and the fact 
that we see especially large improvements in recall provides 
partial support towards our hypothesis that clusters can po- 
tentially alleviate the problem of aspect recall. 

Moving on to our second main comparison, Table shows 
that our cluster-based methods usually yield results that 
are better (sometimes to a significant degree) than those of 
Rocchio. Comparing our algorithms' performance to that 
of the two versions of the relevance model, we observe the 
following: on AP89, our cluster-based methods yield results 
that are, in a statistical sense, indistinguishable from those 
of the (clipped) relevance model; on LA+FR our methods 
tend to be superior to the (clipped) relevance model (some- 



times significantly so); but on AP88+89, the (clipped) rel- 
evance model generally performs significantly better than 
our cluster-based methods. In interpreting these results, 
though, it is crucial to note that (1) our implementation of 
the (clipped) relevance model involved an extremely wide- 
ranging search over the parameter space, whereas as dis- 
cussed in Section f4. II our methods only explored moderate 
parameter-setting ranges; and (2) clipping is a heuristic that 
could potentially be adapted for use by our document- or 
cluster-based language models. 

While some preliminary results indicate that the perfor- 
mance of our methods can be further improved by more 
exhaustive parameter tuning, we believe that the main mes- 
sage of Table His that we can achieve performance compet- 
itive with optimized state-of-the-art pseudo-feedback meth- 
ods with relatively little optimization effort. 

4.3 Further analysis 

Examination of the average-precision results in Table Q 
reveals that the P* Interpolation technique is usually more 
effective at coping with query drift than Truncated P* Re- 
rank. Table |3] provides a more extensive comparison of the 
full set of query-drift-prevention techniques we have pro- 
posed. For simplicity, we report only the results of applying 
these methods in conjunction with the Doc-Audition algo- 
rithm. It is apparent that most of our techniques achieve 
comparable or better precision than is obtained by the orig- 
inal method by itself, and that P* Interpolation is the "win- 
ner", although the more conservative iterated version (It- 
erated P* Interpolation) ties it on two corpora. Investiga- 
tion into the optimal parameter settings revealed that, while 
performance for Doc- Audition itself was optimized at a low 
number of iterations (which would have the effect of keep- 
ing precision from degrading) , performance when prevention 
techniques were applied was best for a larger number of it- 
erations, enabling increase in recall along with preservation 
of high precision rates. 
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Iterated Truncation 


19.71% 


51.09% 


26.86% 


63.37% 


16.84% 


31.63% 


Iterated Truncated P* Re-rank 


22.76% 


58.02% 


27.96% 


68.05% 


22.49% 


59.45% 


Iterated P* Interpolation 


24.27% 


59.09% 


30.28% 


75.57% 


22.98% 


56.00% 



Table 3: Comparison of techniques for query drift prevention, the Doc-Audition scoring method. Bold: best 
performance for a given evaluation setting (column). Note that Truncated P* Re-rank improves recall over 
the basic algorithm (none) as it achieves optimal precision with a different parameter setting. 



AP89 

RecallSPrecision wrt number of initially retrieved docs, corpus = 



0.12 0.14 0.16 



Clipped Rb I Mode I ( 
0.2 0.22 0.24 0.26 0.28 



AP88+89 



RecallSPrecision wrt number of initially re 




0.25 0.26 0.27 0.28 0.29 0.3 



:. ; /.... :.in.'IMQdel -r->- 

0.31 0.32 0.33 



LA+FR 



RecallSPrecision wrt number oi niili-i ly 




0.13 0.14 0.15 0. 



RecallSPrecision wrt number ol initially retrieved docs/clush 




In Clusi - 
RelModel ■ 
ClippedR'. I-'.. Id 

0.2 0.22 0.24 0.26 



y 




j& / i 








f y 


Int-Clust — 

Re I Model 

Clipped Re I Mode I — O— 



0.25 0.26 027 



0.29 0.3 0.31 0.32 0.33 




Figure 1: Average precision vs. recall as n, the number of initially retrieved best Tenderers for the original 
query, takes the values 5,10,20,...100,200,...,500. Top and bottom rows show the best non-cluster- and cluster- 
based algorithms, respectively, against the (clipped) relevance model (note the relatively severe degradation 
in precision — "leftwards motion" — of the latter). To show detail, the plots are not to the same scale. 



Finally, we explored a well-known weakness of language- 
model-based approaches using pseudo-feedback: sensitivity 
to the number n of documents initially retrieved |281 . First, 
we see from Figure that the precision of our novel algo- 
rithms is much less affected by increases in t\ than the pre- 
cision of the (clipped) relevance model, which indicates the 
merits of both of our query-drift prevention techniques (al- 
though clearly P* Interpolation almost always outperforms 
Truncated P* Re-rank at all values of ri). It is also inter- 
esting to observe that increasing ti tends to have a positive 
influence on the recall of our cluster-based methods (which, 
after all, were posited to improve aspect recall), whereas 
eventually it has a negligible or negative influence on the 
(clipped) relevance model's recall. In short, the performance 
curves for our algorithms tend to move vertically, whereas 
the (clipped) relevance model's curves seem to exhibit more 
horizontal movement. These trends suggest that our meth- 
ods and the relevance model have complementary strengths. 



5. CONCLUSIONS 

We presented a novel iterative pseudo-feedback approach 
to ad hoc information retrieval using cluster-based language 
models. Starting from the original query, our methods re- 
peatedly seek potentially good Tenderers of a current set of 
pseudo-queries, guided by the hypothesis that documents 
that are the best Tenderers of a pseudo-query may be good 
alternate renditions of it. 

One of the major challenges facing today's retrieval en- 
gines is the problem of "aspect recall". To alleviate this 
problem, we proposed to take advantage of corpus struc- 
ture via the consideration of cluster-based language mod- 
els as potential Tenderers; the key idea is that clusters can 
serve as a rich source of information regarding corpus as- 
pects. Likewise, we examined several techniques for reduc- 
ing query drift, which is yet another obstacle that both tra- 
ditional and language-modeling-based pseudo-feedback ap- 



proaches need to overcome. As evidence that our techniques 
are effective, we saw that our algorithms showed significant 
improvements in performance with respect to a standard 
language-modeling approach, and produced results rivaling 
those of other state-of-the-art pseudo-feedback methods. 

For future work, we plan to look into analyzing our meth- 
ods in real-feedback settings, e.g., [271 1121 E], Furthermore, 
we would like to incorporate and examine additional clus- 
tering approaches for modeling corpus structure. 
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